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Abstract In order to conduct a statistical analysis on a given set of phylogenetic gene trees, we often use a 
distance measure between two trees. In a statistical distance-based method to analyze discordance between gene 
trees, it is a key to decide “biologically meaningful” and “statistically well-distributed” distance between trees. 
Thus, in this paper, we study the distributions of the three tree distance metrics: the edge difference, the path 
difference, and the precise K interval cospeciation distance, between two trees: First, we focus on distributions 
of the three tree distances between two random unrooted trees with n leaves (n > 4); and then we focus on the 
distributions the three tree distances between a fixed rooted species tree with n leaves and a random gene tree 
with n leaves generated under the coalescent process with the given species tree. We show some theoretical results 
as well as simulation study on these distributions. Key Words: Coalescent, Phylogenetics, Tree metrics. Tree 
topologies. 


1 Introduction 


A central issue in systematic biology is the reconstruction of populations and species from numerous gene trees with 
varying levels of discordance ( Brito and Edwards||2QQ9 ; Edwards]|2QQ9 ). While there has been a well-established 
understanding of the discordant phylogenetic relationships that can exist among independent gene trees drawn 
from a common species tree (Pamilo and Nei 1988[ Takahata 1989 Maddison 1997 Bollback and Huelsenbeck 
2009), phylogenetic studies have only recently begun to shift away from single gene or concatenated gene estimates 
of phylogeny towards these multi-locus approaches (e.g. (Carling and Brumfield 2008 Yu et al 2011 Betancur 
et a/.|2013 Heled and Drnmmond|20TT Thompson and Kubatko||2013[ )). In order to conduct a statistical analysis 
on the given set of gene trees, we vectorize each tree, i.e., converting them into a numerical vector format based 
on a distance matrix or dissimilarity map. These vectorized trees can then be analyzed as points in a multi¬ 
dimensional space where the distance between trees increases as they become more dissimilar ( Hillis et a/.||2005 
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Semple and SteeI||2QQ3 Graham and Kennedy 2010). Such statistical applications that test for incongruence or 
congruence between two trees using a measurement of dissimilarity between a pair of trees are called distance- 


based methods (for example, Holmes (2007); Arnaoudova et al (2010); Weyenberg et al (2014) are such statistical 
methods). In a statistical distance-based method to analyze discordance between gene trees, it is a key to decide 
biological meaningful” and “statistically well-distributed” distance between trees ( Steel and Penny]|1993 Coons 


and Rnsinko||2014 ). Therefore we have studied the distributions of some well-known tree distances between trees. 


In this paper we focus on three topological tree distances edge difference distance ( Williams and Chfford||1971 ), 
and precise k-Interval Cospeciation (K-IC) distance ([Huggins et al. 2012|), and the path difference (Steel and 


Penny|[T9^ ) while the distributions of Robinson-Foulds (RF) distances ( [Robinson and FouldsjjlQSl ) and quartet 


distances ( Brodal et a/.[[2001 ) between random trees are very well studied (for example. Steel and Penny (1993)). 

Here we have conducted simulation studies on these distributions and we have shown theoretical results on 
the distributions of these tree distances between the species tree and gene trees which are generated under the 
coalescent process ( Degnan and Salt^[2QQ5a ). 


For the precise K-IC distance between two random trees, Coons and Rusinko (2014) showed that if we take 


the random trees and compute the distance between them and if we send the number of leaves n of the trees to 
infinity, then the probability that the distance between two random trees becomes the worst possible distance, 
that is (n — 3), goes to zero while the probability that the RF distance between two random trees becomes the 
worse possible, that is 2n — 6, goes to one (Theorem 8 in Coons and Rusinko ( 2Q14[ )). This proporty is very 
important to have in terms of applying statistical analysis on the distances of trees. In addition, [Steel and Pennyj 
(1993) showed some simulation study as well as some theoretical study on the distributions of the RF distance. 


Quartet distance and path difference distance between random trees with n = 12 leaves (see Figure 6 on Steel and 
Penny ( 1993[ )). A key ingredient of analyzing distributions of these three tree distances between two random trees 
with n leaves is a simple observation that the precise K-IC distance between trees is loo norm of two vectorized 
trees, the path difference distance is I 2 norm of two vectorized trees, and the edge difference distance is li norm of 
two vectorized trees. First, in this paper, we will show some theoretical results comparing distributions of these 
tree distances between random trees with n leaves. 

A coalescent process is often used to model gene trees given a fixed species tree with n leaves. These theoretical 


developments have been used to reconstruct species trees from samples of estimated gene trees in practice (Mad- 

dison and Knowles|2006| 

Carstens and Knowles|2007|[Edwards et a/.|2007|[Mossel and Roch|2010|[RoyChoudhury 

et al 2008). Rosenberg ( 

2002) studied the distribution of the topological concordance of gene trees and species 

trees under the coalescer 
polyphyly in a coalescent 
coalescent process. In thi 
K-IC distances between 
This paper is organiz( 
the distributions of these 
[3.1[ we will show the vari 
leaves. In Subsection [3.2[ 
K-IC distances between 

it process, Rosenberg (2003) worked on the distributions of monophyly, paraphyly, and 

; model, and Degnan and Salter (2005b) studied the distribution of gene trees under the 

s paper we focus on the distributions of the edge difference, path difference, and precise 
the fixed species tree and gene trees generated under the coalescent process, 
sd as follows. In Section [^ we remind readers some definitions. In Section [^ we focus on 
three tree distances between two unrooted random trees. More specifically, in Subsection 
ance of the distribution of the path difference distance between two random trees with n 
and 3.3 we will compare the means of the distributions of the edge difference and precise 
random trees with the mean of the distribution on the path difference distance between 


them. In Section [^ we focus on the distributions of these three different tree distances between a fixed species 
tree and a gene tree generated from the coalescent process with the species tree. Especially we have computed 
explicitly the probability that the distribution of any of the three tree distances between a fixed species tree and 
a gene tree generated under the coalescent process. In Sectionwe have shown several simulation studies on the 
distributions of the three different tree distributions between random trees as well as between a fixed species tree 
and a gene tree generated from the coalescent. We end with discussions in Section [^ 


2 Basics and notation 

In the subsequent descriptions, let n be the number of leaves (terminal taxa) in a tree. Let Tn be the space of all 
possible unrooted trees on n taxa and let 7^ be the space of all possible rooted trees on n taxa. In this paper we 
consider only tree metrics between two trees using topological information of the trees, i.e., this tree space does not 
incorporate branch length information. We use 11 • | |p to represent the usual Ip norm of a vector, and | • | to indicate 
the cardinality of a set. A tree distance is a function, d : Tn x Tn ^ that has, at a minimum, the properties 
d{r, s) = d{s, r) and d{t, t) = 0. Many of the methods also require a vectorization function, v : Tn ^ for some 
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T1 T2 

d d 




Fig. 1: Example phylogenetic trees: Ti and T 2 . The trees represent proposed most recent common ancestor 
relationships between 5 taxa, labeled a through e. These trees have branch lengths specified, but not all trees 
need have such information. 


m, which maps phylogenetic trees into Euclidean space. The symmetric difference between two sets is defined as 
AQB :={A\B)[J{B\A). 

Several popular tree distances are squared Euclidean distances as will be demonstrated below. 

The dissimilarity map or distance matrix of a tree T is a n x n symmetric matrix of non-negative real numbers, 
with zero diagonals and off diagonal elements corresponding to the sum of the branch lengths between pairs of 
leaves in the tree. 

Suppose V : Tn ^ is a function such that the {i,j)th coordinate, where 1 < i j < n, of the v{T) is the 
number of edges on the unique path between leaves i and j on T. 


2.1 Path difference 


The RE distance is completely determined by the topologies of the trees, ignoring any edge lengths that may be 
present. Conversely, the dissimilarity map distance requires that the edge lengths be defined. The path difference 
distance dp is a distance analogous to the dissimilarity map, but which does not require edge length information. 

The calculation of the path difference is identical to the dissimilarity map, except that elements in the distance 
matrix D{T) are determined by counting the number of edges between the leaves, rather than summing the edge 
lengths. (This is equivalent to the dissimilarity map distance with all edge lengths in the tree set equal to 1.) The 
path difference is studied and compared with the RE distances by Steel and Penny (1993). 


Using the lexicographical ordering in the coordinates of the vector, we find that the path difference vectoriza- 
tions of our example trees are 

^(Ti) = (2,3,4,4,3,4,4,3,3, 2), 

^(T2) = (2,4,4,3,4,4,3, 2,3,3). 

The path difference is therefore, dp{Ti,T 2 ) = ||'^(Ti) — v{T 2)\\2 = 


2.2 Edge difference 


This tree metric between two trees is defined by Williams and Clifford (1971). Suppose we have two trees Ti, T 2 G 
Tn- Then the edge difference de is a distance measure between two trees Ti, T 2 G 7^ such that 

de(Ti,T2) = ||^(Ti)-i;(T2)||i. 

The edge vectorization of any tree is exactly the same as the path difference vectorizations of the tree. The 
edge difference is therefore, de(Ti,T 2 ) = ||'^(7"i) — '^(72)1 |i = 6. 
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2.3 Precise /c-interval cospeciaion 


The precise /c-interval cospeciaion (k-lC) distance dk is also a distance analogous to the path difference distance, 
but which uses loo norm instead of I 2 norm. This tree metric was defined by Huggins et al (2012). 


The precise k-lC vectorization of any tree is exactly the same as the path difference vectorizations of the tree. 
The precise k-lC is therefore, dk{Ti^T 2 ) = ||'^(Ti) — v{T 2 )\\oo = 1- 

Using the definitions of the tree differences de, d^, d^ between any two trees Ti, T 2 G 7^ we can immediately 
have the following remarks. 


Remark 1 — The tree differences de, dp^ dk between any two trees Ti, T 2 ^ 7^ 


are tree metrics. 

— The tree differences de, dp^ dk between any two trees Ti, T 2 G 7^ can be computed in O(n^). 

— Many tree metrics such as Nearest-Neighbor-Interchange distance, Subtree-Prune-and-Regraft distance, and 
Tree-Bisect ion-and-Regrafting distance are NP-hard Dasgupta et al. (1997); Hickey et al. ( |2QQ8 ); Allen and 


Steel (2001). 


3 Distributions of the three tree metrics between unrooted random trees 

In this section we focus on the distributions of the path difference, edge difference and precise K-IC distances 
between unrooted random trees from 7^. 


3.1 Distribution of path difference metric between two trees 

Suppose we sampled trees from the uniform distribution over 7^. In this section we consider the distribution of 
the path difference tree metric dp between two random trees sampled uniformly from 7^. 

Recall that b{n) is the number of binary trees with n labeled leaves. Then we have the following theorems. 


Theorem 1 (Theorem 3 from Steel and Penny (1993)) 

distribution over Tn- Let dij{T) for '7' G Tn 
Then, 


Consider the distribution of d^ under the uniform 
be the number of edges on the unique path between a leaf i to a leaf j. 


E[dij{T)] = a{n), 

V[dij{T)] = 4n - 6 - a(n) - a‘^{n), 


2 ^ 

where o(n + 2) = 


and 


/i,(n) = 2( JV[d,,(T)] 


where iap{n) is the expeeted value of dp under the uniform distribution over Tn 


( 2 ) 


Proof : In this paper we only show the proof for pip{n). The rest of the proof for this theorem see Steel and Penny 
(1993). By definition of dl we have: 


dliT^T') =11 d{T) - d{T’) 111= Y}dij{T) - dij{T')]\ 

i<j 
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where T and T' are two random binary trees. So the mean is: 


= E[4(T,T')] = ^ Pr(T)Pr(T')d2(T,T') 

TT' 

T,T' ^ ^ i<j 

= ^ E +dijirr - 2dij{mj{r)] 

^ ^ T,T' i<j 

= ^E E + E dijiT'f -2 ^ d,,{T)d,j{r) 

^ ' i<j T,T' T,T' T,T' 

= ^E [e fE^»i(^)'l +E fE^»i(^')^) -25:d.,(T) (j2d,,iT') 

^ ^ i<j l T' \ T J T \t' J T \ T' 


J2 26(n)^di,(r)2-2('^di,(r) 

i<j T \ T 


Notice that ^ f{dij{T)) does not depend the selection of i and j because of the symmetry of labeling (it is easy 

T 

to prove by contradiction and switching the labels). Therefore ^ f{dij(T)) = f{dki{T)) with i < j, k < I, and 


thus we have: 


b{ny \2 


b{n)J2dij{Tf- (j2dij{T) 

T \ T 


dij{T)'^ 
^ 6(n) 


dij{T) 


= 2 ( 2 ) 

= 2 f imiiTf] - ndij{T)f) =2(^) Var{dij{T)) 


with any selection of i and j. 


Theorem 2 <Jp{n), the variance of dl, is 


+E Y.^iAT'? + 4 ^ ^dy(rK,(T0 

T,T' i<j T,T' i<j T,T' i<j 


<^pW = +2|(^^^6(n)[4n-6-a(n)]| 


-Sb{n)a{n) EE dijiTfl 

T i<j i<j 


-4[(”)V[d,,■(T)]]^ 


Proof: 
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Since (Jp{n) 
only 


Var{dp) = ^[dp] — /j.p{n)‘^^ where the explicit formula of /J.p{n) is known, we have to consider 


E[4{T,T')] = ^Pr(T)Pr(r')[4(T,T')]" 

TT' 

T.T' '' ' \i<j / 


b{ny 


E 

tt' 


2 


^ dij {Tf + Y, dij {T'f - 2 ^ dij {T)dij {T') 

i<j i<j i<j 



r n 

2 

r n 

2 

r n 

2 \ 

E 


+ E 


+4E 

Ydij{T)dij{r) 


tt' 

i<j 

r 

tt' 

1 r 

i<j 

1 

tt' 

i<j 



b{nY 


+ 2 E 

tt' 

- 4 E 

tt' 

- 4 E 

tt' 


i<j 

i<j 

i<j 


i<j 


YdiAT)dij{T') 

i<j 

YdiiiT)dij{r) 

i<j 


/ 

- 

2 

- 

2 

- 

2x 

E 


+ E 


+4E 

Ydij{T)dij{T') 


TT' 

i<j 

TT' 

i<j 

T,T' 

i<j 



b{nY 


+ 2 E 

tt' 

-8E 

T,T' 


i<j 

i<j 


i<j 

YdiAT)dij{T') 


1<J 


> . 


In this equation, two terms can be simplified as: 


E 

M 

'S; 

3 

to 

_I 


= 

EE^»i(^)' 


T,T' 

_i<i J 

i<j 



_ T' i<j 




= 

( 2 ) 

1 ^ 


= {( 2 ) ^N[4?i-6-a(n)]} . 
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E 

T,T' 


T^iAT? 

J2dij{T)di,{r) 

= E 

M 

3 

to 

_1 


i<j 

i<j 

T 

_i<j J 

T' i<j 



= E 

T^iAT? 

J2dijiT)b{n)E[di,{I 



T 

i<j 

r 

i<j 

1 r 1 


= b(n)a{n) 


T^iAT? 

T^iAT) 

i<j 

i<j 


□ 


3.2 Distribution of the edge difference metric between two trees 


Theorem 3 Consider the distribution of d^ under the uniform distribution over Tn • Then, using the relation 
between Ip norm and Iq norms where 0 < q < p sueh that \ \x\\p < \ \x\\q we have the following theorem: 



(4n — 6 — a{n) — a^{n)) < jXe{n) < 



(3) 


where jXe{n) is the expeeted value of de under the uniform distribution over Tn- 

Remark 2 Let B{x) = X]n>o exponential generating function for the number of planted binary 

trees, b{n + 1), with n labeled non-root leaves (or the number of rooted binary trees with n leaves). Let 

F{x, y) = yB{x) + y‘^B{x) + ...= - 1 

be the exponential generating function for the number of ordered forests consisting of a given number of rooted 
trees (marked by y) and a given number of leaves (marked by x). Then for a fixed pair of distinct leaves i and j 
(we can set i = 1 and j = 2), we have 


n—1 


^ n—1 


^ ^ \d,jiT) - d,jir)\ = Y,[y^][x^-^]yFix,y) ( ^ |r - r'\[y^'][x--^]yF{x,y) ) , 

TeTn T'eTn 


r=2 


\ r '=2 


where [x^][y^ ]f{x,y) denotes the coefficient of x^ • y^ in the function f{x,y). 


k ,,k' • 


3.3 Distribution of the precise k-lC tree metric between two trees 


Now we consider the distribution of d^ under the uniform distribution over Tn- Then, using the relation between 
Ip norm and Iq norms where 0 < q < p such that \\x\\p < \ \x\\q < we have the following theorem: 

Theorem 4 Consider the distribution of dk under the uniform distribution over Tn - Then, 


^2 (4n — 6 — a{n) — a^{n)) < fikixi) < 



(4n — 6 — a{n) — a^{n)) 


( 4 ) 


where pk (^) 'Is the expeeted value of dk under the uniform distribution over Tn - 

Remark 3 Using the same relation above, we can use Pk{xi) as an upper bound for Pp{n) and Pe{xi)^ that is 


< \fC)Pk{n) 

Me{«) < (2)Mfe(«’)- 
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T1 


T2 



a 


b 


b 



Fig. 2: Example phylogenetic rooted trees: Ti and T 2 . The trees represent proposed most recent common ancestor 
relationships between 3 taxa, labeled a through c. 


4 Species tree and gene tree under the coalescent 

Let 7^ be the space of rooted trees with n leaves. Note that 7^ = 7^+i. In this section we consider the distances 
between a species tree and a gene tree under the coalescent given the species tree. First we consider the following 
two lemmas from ( Coons and Rnsin^|2Q14 ). 


Lemma 1 (Lemma 1 from Coons and Rusinko (2014)) For any two trees Ti, T2 G 7^; d/c(Ti, T2) < (n — 2 ). 

A eaterpillar tree is any unrooted binary phylogenetic tree which reduces to the path if we delete all edges 
attached to a leaf and all leaves (see Figure]^ for an example). 

Lemma 2 (Corollary 1 from Coons and Rusinko| ( |2Q14[ )) If dk{Ti^ T 2 ) = (n — 2) for Ti, T 2 G 7^; then Ti 
or T 2 is a eaterpillar tree. 


Coons and Rusinko (2014) considered unrooted trees in 7^. In the case of unrooted trees in 7^, we have the 
bound (n — 3) in Lemma and Lemma But in this section we consider 7^, the space of rooted trees and using 
the fact that Tf = 7^+i, thus we have the bound {{n + 1) — 3) = (n — 2). For example, if we consider Ti and T 2 
in Tf as seen Figure § then 4(Ti, T 2 ) = ||(2,3,3) - (3,2,3)||oo = (3 - 2) = 1. 

Thus, a caterpillar tree is a special case, so we consider that the species tree G 7^ be a caterpillar tree. In 
this section we also consider a sample size of individuals from each species is one and each species has the same 
effective population size N^. Let ti be a time interval in the coalescent time unit between the {i — l)th event when 
two species are coalesced to the ith event when two species are coalesced (see figure]^. 

Let T 5 G 7^ be a caterpillar tree. Now we consider the probability that G 7^ and a gene tree Tg generated 
by the coalescent given the species tree Tg have the same tree topology. 

Let Qij (t) be the probability that i lineages derive from j lineages that existed t > 0 coalescent time units in 
the past such that 


k=j 








where Qf/ c) = cl{cl + 1 )... (a + /c — 1 ) for k > 1 w ith g^p) = 1 ; and = a{a — 1 )... (a — /c + 1 ) for k > 1 with 
a(o) = 1 ( Takahat^|1989 Takahata and Nei||199Q Tavar6||1984 ). gij{t) = 0 except with 1 < j < i. 


Remark ^ If t is a scale of coalescent time units then t can be written as t = ^ where t' is the number of 
generation and is a population size. We assume that the size of an ancestral species is the sum of the sizes of 
its descendants so that the scaling of time would be different before and after the divergence of the ancestor, i.e., 
before diverging the scale of coalescent time unit would be t = 7 ^ and after diverging it would be t = 

Remark 5 In fact, we can simplify g 2 i{ti) for some coalescent time interval > 0 and it can be written as 


g 2 i{U) = 1 - exp(-ti). 
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Before we show the probability that any of these three distribution between the caterpillar species tree and 
gene trees generated from the coalescent process equals to zero, we have to define some notation. 


To consider this problem, we need to count the number of cases of M G N branches with N E N lineages in 
total. Let Cn,m be the number of cases that N lineages coalesce to M lineages. We call the number of lineages 
in a specific branch the “branch degree”. Obviously, the answer depends on if we consider the orders among 
branches with the same branch degree. If we consider the two figures in Figure as different cases, then it is 

(') 

not very difficult to obtain that Cn,m = 


n'L G) 


However, it will be more complicate if we consider them as 


the same case. We need to first enumerate all possible ordered M branch degrees (number of lineages coalesce in 
the branch), then sum up the number of cases for each ordered branch degrees. For example, when N = b and 
M = 3, we have two possible ordered branch degrees (113) and (122); since for we have ( 3 ) * (2 • 3 — 3)!! = 30 
cases for (113), and ( 2 ) { 2 )/‘^ = 15 cases for (122), we have 45 cases in total. 




(a) 12 happens after 34 (b) 34 happens after 12 

Fig. 4: 4 lineages coalesce to 2 lineages with the same topology 12 | 34 


M 

Define Vm,n = • • •, i^m) ^ < W 2 < < wm} as the set of all possible ordered 

1=1 

branch degrees. It is trivial to prove that we can enumerate all elements in 'Dm,n without duplication in the 
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following way: 

^M,Ar = {(11^1,11^2, • • • ^ : ini = 1 , 2 ,..., 11^2 = 1 , 2 ,..., [ ^ J, 

M -2 

N — ^ Wi M-l 

..., wm-1 = 1,2,..., [—— j , wm = N - Wi}, 

1 = 1 

where “[-J” gives the largest integer that is smaller than a specific real number. We can define an 1-1 mapping 
over T>m,n such that Vw = (rci, 102,..., wm) ^ w maps to two vectors n(w) = (no, ni,..., n/) G and 

u(w) = (no, ni,..., n/) G which satisfy 

w = (no,-- - ,no, ni,--- ,ni,--- ,nz,--- ,n/). 

^ ^ ^ ^ ^ ^ ^ ^ ^ 
uo many u\ many ui many 


I I 

where no = 1 < ni < n 2 < • • • < n^. Notice that this implies = M and 

Ck :=0 q :=0 


Lemma 3 


Cm. 


N 


E 

n(w),u(w):wGl^M,M 


m ((2n^-3)!!)“- ] ^ 

^ 0 ! uJ{nJ)^o. I • 


Proof: Consider n(w) and u(w) of an arbitrary w G 'Dm,N‘ We have branches with degree no,, o = 0,1,..., / . 
For each branch with degree n^, we have {2na — 3)!! different tree topologies. Notice that we don’t consider the 
permutation among the Ua branches with degree Ua- Thus the number of cases that we choose first Ui branches 
with degree ni is: 


Q(XT)---r~^r'^N[(2ni-3)!r 

Ml! 

A^! (AT - m)! {N - (m - [(2ni -3)!!]“^ 

ni\-iM — ni\-iM — 2ni)! ni\{N — Uini)\ Mi! 

_ m _ [(2ni -3)!!]“i 

(ni!)“i (A/^ — Mini)! mi! 


□ 


Therefore, consider the rest branches, the total number of cases, Cm,n, is: 

O ■ ■ ■ [( 2«1 - 3)!!]“^ [( 2 n 2 - 3)!!]“^ 


Ui\ 

("^-14““"“) ■ ■ ■ - 3)!!]^ 


M 2 ! 


ui\ 

[(2ni-3)!!]“i 


A^i 


{N - u,n,)\ [(2n2 - 3)!!]^ 


a=l 


M 2 ! 


{ni\)^‘{N - Y, UanaV- 

a=l 


Ui\ 


= N\ . 


[(2ni - 3)!!]^^ [( 2 n 2 - 3)!!]^^ [(2nz - 3)!!]^^ 1 


(ni!)^ini! {n 2 \)^‘^U 2 \ {ni\)'^iui\ (no)! 

Example 1 The following table gives the values of Cm,n when N < 6: 







Distributions of topological tree metrics 


11 


1 2 3 4 5 6 

1 i 3 15 r05 94^ 

1 3 15 105 945 

1 6 45 420 

1 10 105 

1 15 

1 

Take A/" = 6, M = 3 for example. There are 3 possible ordered branch degrees: 

1. w = (114), n = (14), u = (21), number of cases: || • = 225; 

2. w = (123), n = (123), u = (111), number of cases: ly * ~ 

3. w = (222), n = (12), u = (03), number of cases: || • s' = 15. 

So C 6,3 = 225 + 180 + 15 = 420. 

For n species, n — 1 coalescences should happen during coalescent times ti, ^ 2 ,..., Here, we call the pattern 
of how these coalescences (regardless of which lineages are coalescing) distributed over the coalescent times, i.e. 
in which coalescent time does the kth coalescent happen, the coalescent timeline. When the gene tree completely 
matches the species tree, we know that the tree topology of the gene tree is fixed, i.e. the pattern and ordering of 
coalescent are fixed. This means that the only thing we need to think about is the coalescent timeline. Let’s first 
see a simple example. 

Recall: gij{t) is the probability that i lineages coalesce to j lineages in time t. 

Example 2 Consider 3 species. Fix the species tree to be 12 | 3. Figure gives all possible gene trees based on 
this species tree. 




12 3 12 3 


(c) 1 I 23 (d) 12 I 3 



Fig. 5; All possible gene trees for the fixed species tree 12 | 3 
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We can compute the probabilities of these trees as following and verify them by summing up to 1: 
— Cases for Figure [5(a)| Figure [5(1^ and Figure [5(^ 

Pr((l, 2) in ts, (12, 3) in ts) = Pr((l, 3) in ts, (13, 2) in ts) 

= Pr((2, 3) in ts, (23,1) in ts) = 


1 / ^ 1 -/ 

^g22{t2) = -e \ 


Notice that we have here because all these trees share the same coalescent timeline (both coalescences 
happen in ts), and w e have Csp cases in ts where 3 lineages coalesce to 1 lineage; 

- Case for Figure 5(d) Pr((l,2) in ^ 2 , (12,3) in ts) = 5^21 (^ 2 ) = 1 — . 

In this example, Pr(d(Ts, Tg) = 0) = Pr((l, 2 ) in ts, ( 12 ,3) in ts) + Pr((l, 2 ) in ^ 2 , ( 12 , 3) in ts) = 1 — 

Since for each coalescent timeline, there is only one case gives a gene tree which completely matches the species 
tree, all we need to do is enumerate the coalescent timeline and compute probability for each of them. 


Theorem 5 For 


n species, 


1 2 

PT{d{T,,Te) = 0) = J2 E ■ 

« 2=0 13=^2 


k-1 


n—2 


E - E 


'^k — '^k— 1 


*n — 1 — f'n — 2 


I I Qk — 1 ^k i/g (^/c) 
^^2 ^k-ik-i,k-ik 




- 1,1 


(5) 


where ii = 0. 

Proof: Several requirements when we enumerate the coalescent timelines: 1) no coalescent in time ti; 2) if the 
ith coalescence happens in time , then i + 1 < > n; 3) if the ith and jth coalescences happen in time 

and tkj respectively and i < j, then ki < kj (otherwise the gene tree will have a different tree topology with the 
species tree); 4) all lineages coalescent to one in time 

In Equation]^ every choice of (ii,^ 2 , • • •,'<^n-i) gives a possible coalescent timeline: ik coalescences happen 
before or during time t/c, k = l,2,...,n — 1, and (n — in-i) coalescences happen during time tn- It is trivial to 
see the these choices enumerate all possible coalescent timelines without duplicate. 

Now consider a specific (ii, 22 , • • •,'^n-i)- Then during time k = 2,3, ...,n — 1, since the input has k 
species with coalescences, i.e. k — lineages, and the output has k species with coalescences, i.e. k — ik 

lineages, the probability that the gene tree completely agree with the species tree is example 

^k-%k — \,k-%k 

in Figure]^. During time we left n — in-i lineages and they should coalesce to one, so the probability should 
be 


Cr,-i 


□ 


Example 3 There are five cases forn = 4 so that gene tree completely matches the species tree. We apply Theorem 
[^for n = 4 in and obtain the following probabilities for each of the cases: 

1 . Coalescents ( 1 , 2 ) in (12,3) in ^ 4 ; (123,4) in ^4 (see Figure 7(a) ). Probability is ^^ 22 (^ 2 )^ 33 (^ 3 ); 

2 . Coalescents ( 1 , 2 ) in ts; (12,3) in (123,4) in (see Figure Probability is \g 22 {t 2 )g:i 2 {h)\ 

3. Coalescents ( 1 , 2 ) in (12,3) in ^ 3 ; (123,4) in ^4 (see Figure ra. Probability is Ig 22 {t 2 )gsi{ts); 

4. Coalescents ( 1 , 2 ) in ^ 2 ; (12,3) in ^ 4 ; (123,4) in ^4 (see Figure 7(d)). Probability is ^ 5 ' 2 i(^ 2 )^ 22 (^ 3 ); 

5. Coalescents ( 1 , 2 ) in ^ 2 ; (12,3) in ts; (123,4) in ^4 (see Figure 7(e)). Probability is ^21 (^ 2)^21 (^ 3 ); 

Then we have formula: 

Pr((i(Ts,Te) = 0) = j^^22(^2)^33(^3) + g^22 (^2)^32(^3) + ^^22(^2)^31 (^3) + ^^21 (^2)^22(^3) + ^21 (^2)^21 (^3)- 


By Theorem if we have larger tk for k = 1, • • • ,n, then we have higher probability that the species tree 
Ts and its gene tree Tg generated under the coalescent given Tg have the same tree topology. In addition, since 

k-lC is the /oo norm of the vector in the path difference is the I 2 norm of the vector in and the edge 

difference is the h norm of the vector in M^^)^ k-lC distance tree metric can be used for the upper bound for the 
path difference tree metric and the edge difference tree metric by Remark Thus, by Lemmas and if we 
have larger for /c = 1 , • • • , n, then the distributions of tree distance metric de, dp and dk between Tg and Tg 
are skewed from right. 
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Fig. 6: 5 species with timeline: = (0,0,1,3) 

Z 2 — H = 0 coalescent happened in ^ 2 ; '^3 — "^2 = 1 coalescent happened in ta; 24 — 23 = 2 coalescents happened in 
In time ts, we have 3 — Z 2 = 3 lineages coming and 3 — is = 2 lineages coming out, so the 
probability that we get exactly the same topology as this figure during time ts is ^ 


5 Simulations 


First we have conducted simulations study on the three tree distances, the edge difference, path difference, and 
precise K-IC distances between two unrooted random trees with 12 leaves. We have conducted a simulation study 


similar to what Steel and Penny (1993) did (Figure 6 on their paper). We generated 10,000 unrooted random 
trees with 12 leaves using the function rtree from R package ape ( [Paradis et a/.|[2QQ4 ). Then for each distance 


measure de, dp^ dk we computed a histogram. In order to compare a histogram with each other we normalized 
the distances so that they scale from 0 to 10. The results are shown in Figure We also conducted the same 
simulations with the function rcoal from ape and we have obtained basically the same results. 

In the second simulation part, we conducted a simulation study on the distributions of de, d^, dk between the 
caterpillar species tree and a random gene tree generated from the coalescent process with the species tree. We 
use the software Mesquite ( Maddison and Maddison]|2011 ) to generate caterpillar species trees with 5 leaves, 6 
leaves, 7 leaves and 8 leaves, respectively under the Yule process. Then we simulate 10,000 gene trees within each 
species tree. For all the trees in the simulation, they have the same parameters, that is the effective population size 
Ne = 30,000 and species depth= 1, 000. For each kind of trees with certain number of leaves, we then calculated 
three different kinds of distances between the gene trees and species trees. Table shows the proportions of 0 
and 1 distances in each of the three distances for the rooted trees with 5 leaves, 6 leaves, 7 leaves and 8 leaves. 
Figures [^ [Tol and [TT] show the histograms of three kinds of distances for trees with 5 leaves, 6 leaves, 7 leaves 
and 8 leaves. 


6 Discussion 

While many tree distances measures between trees are hard to compute (see Remarktree distances de, dp, dk 
can be computed in polynomial time in n. Today, we can generate huge numbers of DNA sequences from genomes 
using new generation sequencing techniques and they can generate tens of millions base pairs of DNA sequences. 
In order to conduct phylogenomics analysis on genome data sets we need fast tree distances, such as de, dp, d/^. 
However, in order to understand statistical phylogenomics analysis on genome data sets with thesis tree distances, 
we have to understand distribution of these distances. 

In this paper we have shown some theoretical and simulation results on the distributions of tree distances 
de, dp, dk between unrooted random trees with n leaves and between the caterpillar species tree and a random 
rooted gene tree with n leaves generated from the coalescent process with the species tree. 
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(a) (0,0,0) (b) (0,0,1) (c) (0,0,2) 



1 2 3 4 1 2 3 4 


(d) (0,1,1) (e) (0,1,2) 

Fig. 7: Gene trees with d(Ts^Te) = 0 and their coalescent timelines (^ 1 ,^ 25 '<^ 3 ) 







(a) A histogram of de between two un¬ 
rooted random trees with 12 leaves. We 
scale de from 0.0 to 10.0 so that we 
can compare to the other distance mea¬ 
sures. 







(b) A histogram of dp between two un¬ 
rooted random trees with 12 leaves. We 
scalee dp from 0.0 to 10.0 so that we can 
compare to the other distance measures. 


KIC 





(c) A histogram of d^ between two un¬ 
rooted random trees with 12 leaves. We 
scalee d^ from 0.0 to 10.0 so that we can 
compare to the other distance measures. 


Fig. 8: We generated 10,000 random trees using the function rtree from ape. 


The distributions of tree distances de, dp, dk between unrooted random trees with n leaves seem to be symmet¬ 
ric and we have conducted some goodness of fit test with the Gaussian distribution. However, the null hypothesis 
(the distribution fits with the Gaussian distribution) seems to be rejected (with the number of trees equals to 
10,000), so it would be interesting and useful to know the asymptotic distributions of de, dp, dk between unrooted 
random trees with n leaves. 

In Theorem 1^ we have shown explicitly the probability of the tree distance de, dp, dk between caterpillar 
species tree with n leaves and a random gene tree with n leaves distributed with the coalescent process with the 
species tree equals to zero. Note here the species tree is assumed to be caterpillar because dk between two trees 
can reach its upper bound only if one of them is caterpillar. Figure Figure and Figure show us that 
when the sizes of trees get larger, the centers and variation of non-zero distances also become larger, but zero is 
the only distance value that always guarantee a positive probability for all three types of distances. We are also 






























































Distributions of topological tree metrics 15 


5 leaves 

Sample Proportion 

Mean Distance 

Standard Deviation 

II II 

o 

0.9543 

0.0457 

0.0457 

0.2088 

de = 0 
de = 1 

0.9543 

0 

0.2742 

1.2531 

II II 
o 

0.9543 

0 

0.1119 

0.5116 

6 leaves 

Sample Proportion 

Mean Distance 

Standard Deviation 

dk= 0 
dk ~ 1 

0.9007 

0.0961 

0.1025 

0.3137 

de= 0 

de = 1 

0.9007 

0 

0.8200 

2.4899 

II II 

O 

0.9007 

0 

0.2869 

0.8682 


7 leaves 

Sample Proportion 

Mean Distance 

Standard Deviation 

II II 
o 

0.4824 

0.3516 

0.6842 

0.7420 

de = 0 
de = 1 

0.4824 

0 

6.5920 

6.7687 

dp= 0 
dp = 1 

0.4824 

0 

1.9531 

1.9685 


8 leaves 

Sample Proportion 

Mean Distance 

Standard Deviation 

dk — 0 
dk = 1 

0.0760 

0.2639 

1.8490 

0.9002 

de = 0 

de = 1 

0.0760 

0 

20.2859 

9.0730 

dp= 0 
dp = 1 

0.0760 

0 

5.1175 

2.0716 


Table 1: The proportions of 0 and 1 distances in each of the three distances dg, dp, dk for the rooted trees with 5 
leaves, 6 leaves, 7 leaves and 8 leaves. 


interested in the computing the probability of dk being one, which is generally zero for dg and dp (see Table [^. 
However we do not know many aspects of the tree distance d (one of the distances dg, dp, dk) between them as 
n ^ oo. Thus, we have the following questions. 


Problem 1 Consider the tree distances dg, dp^ dk between caterpillar species tree with n leaves and a random 
gene tree with n leaves distributed with the coalescent process with the species tree. What is the expectation of 
the tree distance d (one of the distances dg, dp, dk) between them? How about variance? Can we say anything 
about the expectation asymptotically? 
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Fig. 9: Histogram of dk for the caterpillar species tree and a random tree generated from the coalescent process 
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Fig. 11: Histogram of dp for the caterpillar species tree and a random tree generated from the coalescent process 
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